Polycot-data-2020.csv focusing on chi-square testReading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):
dat = read.table("../data/Polycot-data-2020.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame': 5585 obs. of 2 variables:
## $ Cotyledons: int 2 2 2 3 2 2 2 2 3 2 ...
## $ Generation: chr "P" "P" "P" "P" ...
head(dat)
## Cotyledons Generation
## 1 2 P
## 2 2 P
## 3 2 P
## 4 3 P
## 5 2 P
## 6 2 P
summary(dat)
## Cotyledons Generation
## Min. :2.000 Length:5585
## 1st Qu.:2.000 Class :character
## Median :3.000 Mode :character
## Mean :2.595
## 3rd Qu.:3.000
## Max. :4.000
dat = within(dat, Generation<-factor(Generation, levels=c("P","O")))
Creating categorical variable: dycots/polycots
dat2 = data.frame(generation=dat$Generation)
dat2$cots = "dicot"
dat2$cots[dat$Cotyledons > 2] = "polycot"
dat2 = within(dat2, cots<-factor(cots))
summary(dat2)
## generation cots
## P:2556 dicot :2437
## O:3029 polycot:3148
#write.table(dat2,file="data/polycot-data-2020-categories.csv", row.names = FALSE, sep=",")
dt = table(dat2$cots, dat2$generation)
library(graphics)
mosaicplot(dt, shade = TRUE, las=2, main="Cotelydons")
library(ggmosaic)
ggplot(data = dat2) +
geom_mosaic(aes(x=product(cots), fill=generation))
dt
##
## P O
## dicot 1612 825
## polycot 944 2204
ct = chisq.test(dt)
obs = ct$observed
exp = ct$expected
df = rbind(obs[1,],exp[1,],obs[2,],exp[2,],colSums(obs))
rownames(df) = c("dycot observed", "dycot expected", "polycot observed", "polycot expected", "total")
r = rowSums(obs)
ss = sum(r)
totals = c(r[1],NA,r[2],NA,ss)
cbind(df,totals)
## P O totals
## dycot observed 1612.000 825.000 2437
## dycot expected 1115.304 1321.696 NA
## polycot observed 944.000 2204.000 3148
## polycot expected 1440.696 1707.304 NA
## total 2556.000 3029.000 5585
ct$statistic
## X-squared
## 722.1475
ct$p.value
## [1] 4.567529e-159
## expected values
2437*2556/5585
## [1] 1115.304
2437*3029/5585
## [1] 1321.696
3148*2556/5585
## [1] 1440.696
3148*3029/5585
## [1] 1707.304
Polycot-data-2020.csvThe code of this section contains plots and statistical tests that were not included in the final version of the webinar. For the webinar, we decided to focus on the chi-square test. Here, we fit a t test, a Levene test (for equality of variances) and a Wilcoxon-Mann-Whitney test which is similar to the t test when normality is violated. We also fit an ANOVA test and its alternative when normality is not met, the Kruskal-Wallis rank sum test.
Reading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):
dat = read.table("../data/Polycot-data-2020.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame': 5585 obs. of 2 variables:
## $ Cotyledons: int 2 2 2 3 2 2 2 2 3 2 ...
## $ Generation: chr "P" "P" "P" "P" ...
head(dat)
## Cotyledons Generation
## 1 2 P
## 2 2 P
## 3 2 P
## 4 3 P
## 5 2 P
## 6 2 P
summary(dat)
## Cotyledons Generation
## Min. :2.000 Length:5585
## 1st Qu.:2.000 Class :character
## Median :3.000 Mode :character
## Mean :2.595
## 3rd Qu.:3.000
## Max. :4.000
dat = within(dat, Generation<-factor(Generation, levels=c("P","O")))
Standard t-test comparing:
Assumptions:
Plotting the densities:
Not really normal!
x = dat$Cotyledons[dat$Generation == "P"]
y = dat$Cotyledons[dat$Generation == "O"]
Variances look similar:
var(x)
## [1] 0.2418511
var(y)
## [1] 0.2789809
But let’s do a Levene test:
leveneTest(Cotyledons ~ Generation, dat, center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 45.032 2.13e-11 ***
## 5583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is less than 0.05, so we reject the null hypothesis of equal variances.
The Levene test also assumes Normality. So, we can run the alternative Fligner-Killeen test with the same conclusion (equal variances):
fligner.test(Cotyledons ~ Generation, dat)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Cotyledons by Generation
## Fligner-Killeen:med chi-squared = 15.035, df = 1, p-value = 0.0001055
Back to the t-test assumptions:
If we ignore the violations and we run a standard t-test, we get:
t.test(x,y, var.equal=TRUE)
##
## Two Sample t-test
##
## data: x and y
## t = -29.717, df = 5583, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4354871 -0.3815863
## sample estimates:
## mean of x mean of y
## 2.373239 2.781776
The p-value is less than 0.05, so we reject the null hypothesis of equality of means. This implies that we have evidence that the number of cotyledons is significantly larger in the offspring than in the parents.
Our justification to do a standard t-test is that if the sample size is moderately large, the two-sample t-test is robust to non-normality due to the central limit theorem.
We can also run the Welch version of t test that does not assume equal variances (this is the default in R):
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -29.897, df = 5529.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4353249 -0.3817485
## sample estimates:
## mean of x mean of y
## 2.373239 2.781776
Alternative to the t-test when normality is violated. This test does not test equality of means, but equality of distributions, so the conclusion could be deceiving (that is, two different distributions with the same mean will have a rejected null hypothesis using WMW test).
wilcox.test(Cotyledons ~ Generation, dat)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Cotyledons by Generation
## W = 2417650, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The p-value less than 0.05 allows us to also reject the null hypothesis of equality of distributions which provides evidence that the number of cotyledons on the offspring group is significantly higher than in the parent group.
Polycot-data-from-NY.csv (not publicly available)The following sections have code that was written before the final topics of the webinar were decided. The code below is only kept for internal book-keeping as the data is not made publicly available (this is not the data used in the webinar). Unlike the main section above, here we do not fit a chi-square test, but a t test, a Levene test (for equality of variances) and a Wilcoxon-Mann-Whitney test which is similar to the t test when normality is violated. We also fit an ANOVA test and its alternative when normality is not met, the Kruskal-Wallis rank sum test.
Reading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):
dat = read.table("../data/Polycot-data-from-NY.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame': 803 obs. of 4 variables:
## $ plant.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cotyledons: int 2 2 2 2 2 2 2 2 2 2 ...
## $ generation: chr "p" "p" "p" "p" ...
## $ parents : chr "" "" "" "" ...
The data has 4 columns:
head(dat)
## plant.ID cotyledons generation parents
## 1 1 2 p
## 2 2 2 p
## 3 3 2 p
## 4 4 2 p
## 5 5 2 p
## 6 6 2 p
dat[200:210,]
## plant.ID cotyledons generation parents
## 200 200 2 O 2x2
## 201 201 2 O 2x2
## 202 202 2 O 2x2
## 203 203 2 O 2x2
## 204 204 2 O 2x2
## 205 205 2 O 2x2
## 206 206 2 O 2x2
## 207 207 2 O 2x2
## 208 208 2 O 2x2
## 209 209 2 O 2x2
## 210 210 2 O 2x2
We can summarize basic counts. For example, we have 622 offspring plants and 181 parent plants. On average, the plants have 2.691 cotelydons (mean=2.691). The median being 3 cotelydons means that 50% of the plants have 3 cotelydons or fewer
summary(dat)
## plant.ID cotyledons generation parents
## Min. : 1.0 Min. :2.000 Length:803 Length:803
## 1st Qu.:201.5 1st Qu.:2.000 Class :character Class :character
## Median :402.0 Median :3.000 Mode :character Mode :character
## Mean :399.2 Mean :2.691
## 3rd Qu.:602.5 3rd Qu.:3.000
## Max. :783.0 Max. :5.000
summary(dat$parents)
## Length Class Mode
## 803 character character
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
Now, we can combine groups in our data:
dat2 = within(dat, group <- "(2+)x(2+)")
dat2$group[dat2$parents == ""] = "parent"
dat2$group[dat2$parents == "2x2"] = "2x2"
dat2$group[dat2$parents == "2x3"] = "2x(2+)"
dat2$group[dat2$parents == "2x4"] = "2x(2+)"
dat2$group = factor(dat2$group,levels=c("parent", "2x2", "2x(2+)", "(2+)x(2+)"))
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
Standard t-test comparing:
Assumptions:
Plotting the densities:
Not really normal!
summary(dat2$group)
## parent 2x2 2x(2+) (2+)x(2+)
## 181 152 174 296
x = dat2$cotyledons[dat2$group == "2x2"]
y = dat2$cotyledons[dat2$group == "(2+)x(2+)"]
Variances look similar:
var(x)
## [1] 0.3275967
var(y)
## [1] 0.4926134
But let’s do a Levene test:
leveneTest(cotyledons ~ group, dat3, center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 0.2634 0.6081
## 446
The p-value is greater than 0.05, so we do not reject the null hypothesis of equal variances.
The Levene test also assumes Normality. So, we can run the alternative Fligner-Killeen test with the same conclusion (equal variances):
fligner.test(cotyledons ~ group, dat3)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: cotyledons by group
## Fligner-Killeen:med chi-squared = 0.33916, df = 1, p-value = 0.5603
Back to the t-test assumptions:
If we ignore the violation of normality and we run a standard t-test, we get:
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -6.8072, df = 363.34, p-value = 4.126e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5422314 -0.2991627
## sample estimates:
## mean of x mean of y
## 2.440789 2.861486
The p-value is less than 0.05, so we reject the null hypothesis of equality of means. This implies that we have evidence that the number of cotyledons is significantly larger in the offspring from “(2+)x(2+)” than from “2x2”.
Our justification to do a standard t-test is that if the sample size is moderately large, the two-sample t-test is robust to non-normality due to the central limit theorem.
Alternative to the t-test when normality is violated. This test does not test equality of means, but equality of distributions, so the conclusion could be deceiving (that is, two different distributions with the same mean will have a rejected null hypothesis using WMW test).
wilcox.test(cotyledons ~ group, dat3)
##
## Wilcoxon rank sum test with continuity correction
##
## data: cotyledons by group
## W = 15216, p-value = 7.461e-10
## alternative hypothesis: true location shift is not equal to 0
The p-value less than 0.05 allows us to also reject the null hypothesis of equality of distributions which provides evidence that the number of cotyledons on the “(2+)x(2+)” group is significantly higher than in the “2x2” group.
If we want to compare more than two groups, we can do so with a one-way ANOVA test. We will compare the number of cotyledons on the four groups:
Assumptions:
res.aov <- aov(cotyledons ~ group, data = dat2)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 18.7 6.248 13.62 1.14e-08 ***
## Residuals 799 366.7 0.459
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject that all groups have the same mean number of cotyledons.
We can check the assumptions:
# Homogeneity of variance
plot(res.aov,1)
# Normality
plot(res.aov,2)
When normality is not met, we can use a Kruskal-Wallis rank sum test, with the same conclusion:
kruskal.test(cotyledons ~ group, data = dat2)
##
## Kruskal-Wallis rank sum test
##
## data: cotyledons by group
## Kruskal-Wallis chi-squared = 39.264, df = 3, p-value = 1.526e-08
We can move on to do Tukey multiple pairwise-comparisons (HSD test):
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cotyledons ~ group, data = dat2)
##
## $group
## diff lwr upr p adj
## 2x2-parent -0.2387685 -0.430637375 -0.0468997 0.0076856
## 2x(2+)-parent -0.0473741 -0.232532612 0.1377844 0.9124834
## (2+)x(2+)-parent 0.1819285 0.017371061 0.3464859 0.0234548
## 2x(2+)-2x2 0.1913944 -0.002228051 0.3850169 0.0540364
## (2+)x(2+)-2x2 0.4206970 0.246670761 0.5947233 0.0000000
## (2+)x(2+)-2x(2+) 0.2293026 0.062703782 0.3959014 0.0023572